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Abstract 

We  consider  nonparametric  estimation  of  probability  measures  for  parameters  in  prob¬ 
lems  where  only  aggregate  (population  level)  data  are  available.  We  summarize  an  existing 
computational  method  for  the  estimation  problem  which  has  been  developed  over  the  past 
several  decades  [4,  8,  16,  21].  Significant  new  theoretical  results  are  presented  which  estab¬ 
lish  the  existence  and  consistency  of  very  general  (ordinary,  generalized  and  other)  least 
squares  estimates  for  the  measure  estimation  problem. 
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1  Motivation 


In  a  standard  nonlinear  regression  problem,  a  mathematical  model  is  proposed  which  links  one 
or  more  states  of  interest  to  the  independent  variables  (regressors)  of  an  experiment  and  to  a 
vector  of  parameters  whose  values  are  unknown  to  the  experimenter.  An  experiment  is  then 
conducted  on  the  physical  or  biological  system  and  data  is  collected  for  one  or  more  states  of 
interest.  The  unknown  parameters  of  interest  are  then  estimated  in  an  inverse  or  parameter 
estimation  problem,  the  theory  for  which  is  well-established  [15,  28,  31].  Yet  in  many  situations 
physical,  biological,  or  experimental  limitations  do  not  permit  one  to  sample  individual  data 
directly.  Rather,  one  obtains  data  at  the  aggregate  level  as  multiple  individuals  are  sampled.  It 
is  commonly  assumed  that  the  states  of  interest  for  these  individuals  are  described  by  a  single 
mathematical  framework,  but  that  each  individual  is  described  by  a  unique  set  of  parameters 
within  that  framework.  For  instance,  the  growth  of  mosquitofish  [9,  16,  17]  and  shrimp  [5,  11,  13] 
have  been  shown  to  be  described  by  a  size-structured  partial  differential  equation  model  in  which 
the  rate  of  individual  growth  is  assumed  to  vary  probabilistically  across  the  population.  HIV 
replication  data  has  been  shown  to  be  accurately  described  by  a  cellular-level  model  in  which 
intracellular  delays  vary  from  cell  to  cell  [7] .  The  probabilistic  distribution  of  parameters  has  also 
been  observed  in  models  of  electromagnetic  polarization  [8,  10,  18,  19]  and  in  the  deformation  of 
viscoelastic  materials  [22],  These  examples  are  considered  at  greater  length  in  the  recent  book 
[21- 

In  each  of  these  examples,  one  has  a  mathematical  model  which  describes  the  behavior 
of  an  individual  but  data  which  has  been  sampled  from  an  entire  population  of  individuals. 
Thus,  in  the  context  of  the  mathematical  model,  one  has  information  not  on  the  value  of  a 
fixed,  single  parameter,  but  rather  on  the  distribution  of  parameters  which  characterizes  the 
behavior  of  the  entire  population.  It  is  this  probability  distribution  which  one  seeks  to  estimate. 
Significantly,  the  data  is  sampled  from  the  state  space  of  the  mathematical  system  and  not 
from  the  parameter  space;  thus  one  does  not  sample  directly  from  the  distribution  of  interest.  In 
developing  a  framework  for  this  estimation  problem,  one  encounters  a  rich  body  of  mathematical 
theory.  In  this  document,  we  summarize  a  computational  method  for  the  estimation  problem 
which  has  been  developed  and  tested  computationally  over  the  past  several  decades  [4,  8,  16,  21] 
(Section  5).  In  Section  4  below,  significant  new  results  are  given  which  establish  the  existence 
and  consistency  of  the  least  squares  estimator  for  this  nonparametric  estimation  problem.  First, 
we  formally  define  the  estimation  problem. 

Suppose  that  the  quantities  of  interest  for  a  single  individual  can  be  described  by  the  math¬ 
ematical  model 

^  =  9{t,y{t)',q,il))r 

y{to)=y  o-  (i-i) 

The  parameter  vector  q  e  Rr  is  specific  to  each  individual  within  the  population  while  the 
parameter  vector  if  e  R“  describes  parameters  common  to  all  individuals  within  the  population 
(e.g.,  environmental  factors).  The  observation  model  solution  is  given  by 

y(t;0,ip)  =Cf(t;q,yo,if)  (1.2) 
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where  6  =  ( q,y0 )  G  Mr+S  =  Mp.  It  is  assumed  /(t;  0,-0)  G  Ms  and  C  G  M.lxs  so  that  y  G  Mb  (In 
the  notation  that  follows,  we  tacitly  assume  l  =  1;  this  is  only  for  convenience  and  all  theory 
presented  holds  for  vector  observations.)  It  is  assumed  that  6  G  0  and  G  $  for  all  individuals 
in  the  population,  where  0  and  T  are  sets  of  admissible  parameters. 

For  the  aggregate  data  problem,  one  can  consider  n  observations  as  random  variables  resulting 
from  the  direct  sampling  of  the  mean  population  state,  but  measured  subject  to  random  error. 
Then  it  is  possible  to  define  the  random  variables 

Vj  =  v{t\Po,ij)o)+£j  (1.3) 


for  j  —  1 , . . . ,  n  where 


v(t’,P,ip)=E[Cy(ti\ip)\P]=  [  Cy(t]  9,  ip)dP{0), 

Je 

and  the  random  variables  £3  represent  measurement  noise,  modeling  error,  microfluctuations, 
etc.  Let  £  =  (E\, . . . ,  £n).  It  is  assumed  that  the  first  two  central  moments  of  the  random  vector 
S  are 


E[£\  =  0 

Var[£]  =  R.  (1.4) 

It  is  most  commonly  assumed  that  the  random  variables  £j  are  independent  and  identically 
distributed,  so  that  R  =  cr2In,  where  In  is  the  n  x  n  identity  matrix.  In  the  theory  presented 
in  this  document,  we  typically  make  this  assumption  of  a  constant  or  absolute  error  model. 
However,  this  is  not  strictly  necessary  and  the  results  presented  can  be  generalized  to  include 
a  wide  array  of  statistical  models  which  are  encountered  in  practical  problems.  Extensions  to 
other  error  models  are  considered  in  Section  6. 

Given  n  realizations  Vj  of  the  random  variables  V)  (which  we  will  sometimes  write  v  and  V 
for  notational  convenience),  the  goal  of  an  inverse  or  parameter  estimation  problem  is  to  produce 
an  estimate  of  the  hypothetical  true  parameters  Po  and  0O-  Of  course,  the  estimated  parameters 
should  be  those  that  best  fit  the  data  in  some  appropriate  sense.  Thus  this  problem  first  involves  a 
choice  of  framework  in  which  to  work.  Given  that  choice  of  framework,  one  must  establish  a  set  of 
theoretical  and  computation  tools  with  which  to  treat  the  parameter  estimation  problem.  For  the 
results  presented  here,  we  focus  on  a  frequentist  approach  using  general  least  squares  estimation. 
Theoretical  results  for  likelihood  estimation  (also  in  a  frequentist  framework)  can  be  established 
with  little  difficulty  from  the  results  presented  here.  For  the  moment,  we  do  not  consider  a 
Bayesian  approach  to  the  estimation  of  the  unknown  distribution  P0.  There  does  seem  to  be 
some  commonality  between  the  nonparametric  estimation  of  a  probability  distribution  and  the 
determination  of  a  Bayesian  posterior  estimator  [21],  However  to  our  knowledge,  a  comprehensive 
comparison  of  the  two  methods  (either  theoretical  or  computational)  has  not  been  performed. 

We  remark  here  that  the  estimation  of  the  incidental  parameter  0  is  not  of  primary  interest 
in  this  document.  Techniques  for  the  estimation  of  -0  fall  entirely  within  the  theory  of  classical 
nonlinear  least  squares.  The  parameter  is  included  in  the  formulation  above  to  provide  clear 
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indication  that  the  theory  presented  below  for  the  nonparametric  estimation  of  a  probability 
distribution  is  compatible  with  the  simultaneous  estimation  of  an  incidental  parameter.  For 
instance,  in  Equation  (1.5)  below,  one  can  define  the  estimator  (Pn,3pn)  in  ('P(@),vh).  Without 
loss  of  generality,  it  will  be  assumed  that  if)  is  known. 

For  the  least  squares  problem,  define  the  estimator 

n 

Pn  —  arg  min  Jn(V,P)  =  arg  min  (V)  —  v{tp  P))2  .  (1.5) 

PeP(©)  Pevie)^ 

3= 1 

We  remark  that  Pn  is  itself  a  random  variable  in  that  it  is  a  function  of  the  random  variables  V] 
(and  hence  £j).  This  dependence  is  generally  suppressed  with  the  exception  of  the  subscripted 
n,  but  should  be  carefully  noted,  particular  in  the  consideration  of  the  existence  and  consistency 
of  the  estimator  (Section  4).  The  inverse  problem  is  then  to  use  realizations  Vj  of  the  random 
variables  Vj  to  compute 

n 

Pn  =  arg  min  Jn(v,  P )  =  arg  min  (vj  —  v(tj]  P))2  .  (1.6) 

PeP(0)  PeP(©)  j=1 

However,  one  cannot  typically  compute  Pn  as  defined.  In  most  practical  problems,  the  model 
v(t;P,  ip)  cannot  be  computed  exactly  and  must  be  approximated  with  vN (f;  P,  i[))  by  some 
numerical  scheme  (e.g.,  hnite  difference  methods,  Galerkin  methods,  etc.).  Similarly,  the  space 
V(O)  has  (uncountably)  infinitely  many  elements  so  that  it  must  also  be  approximated  by  some 
computationally  tractable  sets  Thus,  given  a  set  of  realizations  {vj}  of  the  random 

variables  Vj,  what  one  computes  in  practice  is 

n 

Pn,M  =  arg  &  P)  =  arg  PeXjo)  ^  (vi  ~  yN  ^  ■  (L7) 

The  immediate  question  of  interest  is  how  these  formal  definitions  relate  back  to  the  actual 
quantity  of  interest,  the  unknown  ‘true’  probability  measure  P0-  I11  considering  this  question,  we 
see  that  several  additional  questions  must  be  answered.  First,  it  must  be  shown  that  the  least 
squares  estimator  Pn  given  by  (1.5)  is  well-defined.  The  next  question  is  computational:  as  M 
and  N  grow  large,  is  it  necessarily  true  that  P^m  converges  (in  some  sense)  to  Pnl  Of  course, 
the  answer  to  this  question  depends  largely  upon  the  approximation  schemes  used.  For  instance, 
one  could  define  Vm{ ©)  to  be  the  subset  of  the  space  of  probability  measures  consisting  of  those 
measures  with  a  specific  parametric  form.  While  this  technique  has  the  advantage  of  creating  a 
standard  nonlinear  estimation  problem,  it  may  lead  to  inaccurate  and  misleading  results  unless 
there  is  strong  evidence  to  suggest  a  particular  parametric  form  for  the  unknown  measure.  In 
this  document,  we  are  concerned  with  nonparametric  estimation,  so  that  only  a  minimal  set  of 
restrictions  is  placed  on  the  class  of  admissible  measures. 

The  remaining  question  is  statistical.  Assuming  that  P^m  approaches  Pn  as  M  and  N  grow 
large,  how  does  this  estimate  compare  with  P07  Put  another  way,  given  any  fixed  n  observations, 
one  obtains  an  estimate  Pn  of  Pn.  How  does  this  estimate  improve  as  more  data  is  collected  (that 
is,  as  n  grows  large).  This  is  a  question  of  the  consistency  of  the  least  squares  estimator  Pn. 
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As  will  be  shown  there  is  a  natural  setting,  which  we  will  call  the  Prohorov  Metric  Framework 
(PMF),  in  which  these  questions  can  be  answered  for  parameter  estimation  problems  such  as 
these,  in  which  the  unknown  parameter  is  a  probability  distribution.  We  begin  by  describing 
the  Prohorov  metric  on  the  space  of  probability  measures  and  derive  some  properties  which  will 
be  useful  in  answering  the  questions  posed  above.  Under  a  fairly  general  set  of  conditions,  in  is 
shown  that  the  estimator  Pn  is  well-defined  (in  the  sense  that  it  is  a  measurable  function  which 
maps  the  space  of  data  to  the  space  of  probability  measures).  Next,  this  estimator  is  shown  to  be 
consistent,  and  conditions  for  computational  approximation  and  convergence  are  given.  Finally, 
the  statistical  model  (1.3)  is  revisited  and  the  results  of  this  document  are  extended  to  a  larger 
class  of  problems. 


2  The  Prohorov  Metric 

We  begin  with  several  general  definitions  and  theorems  which  are  meant  to  motivate  the  PMF  and 
provide  some  background.  No  proofs  are  given  for  this  motivating  material,  although  references 
are  provided.  Details  proofs  are  provided  for  the  more  interesting  features  of  the  PMF.  A  number 
of  the  results  presented  can  be  founded  scattered  through  existing  literature.  Many  of  the  results 
of  the  next  two  sections  (and  some  alternative  proofs)  can  be  founded  in  [25,  33]  and  have  been 
usefully  organized  into  an  easy-to-read  series  of  notes  [27]. 

First,  the  Riesz  Representation  Theorem  on  the  space  of  bounded  continuous  functions  is 
stated.  This  theorem  can  be  used  to  characterize  the  weak*  topology  on  the  continuous  dual  of 
the  space  of  bounded  continuous  functions,  which  provides  an  intuitive  motivation  for  the  weak 
topology  on  the  space  of  probability  measures.  It  is  no  surprise  then  that  the  two  topologies 
are  equivalent  on  the  space  of  probability  measures.  Next  the  Prohorov  metric  is  defined  and  is 
shown  to  metrize  the  weak  topology  of  measures.  The  Prohorov  metric  is  then  used  to  establish 
several  desirable  properties  of  the  space  of  probability  measures. 

Consider  the  metric  space  0  with  its  metric  d,,  which  we  can  write  together  as  (0,  d).  Define 
the  space  Cb(@)  =  {/  :  0  — >  M| /  bounded,  continuous}. 

Theorem  2.1  (Riesz).  Assume  (0,d)  is  a  compact  (Hausdorff1 )  space.  For  every  f*  £  Cb(0)* 
(the  continuous  dual  of  the  space  Cb{0)),  there  exists  a  unique  finite  signed  Borel  measure  p 
such  that 

/*(/)  =  [  fwm 

Jo 

for  all  f  £  CB(0).  Moreover,  \\f\\  =  |/x|(0). 

Proof.  See  [30,  pg.  357-358],  □ 

Given  this  identification,  we  may  write  /*  =  /*  when  convenient.  We  see  that  the  set  V(Q) 
of  probability  measures  on  (0,d)  can  be  identified  with  those  f*  C  Cb(Q)*  such  that  /*(/)  >  0 
for  all  /  £  Cb(O)  and  ||/*||  =  p(0)  =  1.  Thus  we  have,  in  a  sense,  that  V(0)  C  Cb(O)*.  In 
fact,  given  any  f  £  CB(0),  the  map  from  CB(0)  into  M  given  by  /}*(/*)  =  /*(/)  defines  the 

Whe  assumption  that  0  is  Hausdorff  will  be  maintained  throughout  this  document. 
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natural  embedding  of  Cb(0)  <— )-  Cb(0)**.  The  image  of  /**  induces  a  topology  on  the  space 

(7b(0)*,  known  to  functional  analysts  as  the  weak*  topology  [2,  49-57].  (That  is,  f*  —>■  f*  if 
and  only  if  /*(/)  — >■  /*(/)  for  all  /  G  Cb(0).)  When  viewed  in  the  context  of  V(Q)  C  Cb(0)*, 
this  is  the  weak  convergence  of  measures  known  from  the  theory  of  probability  and  stochastic 
processes. 

With  this  motivation,  we  now  turn  to  the  problem  of  characterizing  the  weak  topology  of 
measures. 


Definition  2.2.  Let  (@,<f)  be  any  metric  space  (not  necessarily  compact)  and  define  the  set 
Cb(O)  as  above.  Given  any  probability  measure  P  G  V(Q)  and  some  e  >  0 ,  an  e-neighborhood 
of  P  is 


Be(P)  =  Q 


f(9)dQ(9)  -  /  f(9)dP(9) 


<  e,  for  all  f  G  Cb(O) 


(2.1) 


Comparing  the  Riesz  Represention  Theorem  (Theorem  2.1)  with  the  definition  of  Be(P), 
there  is  a  clear  connection  between  the  open  balls  on  V(O)  and  the  weak  topology  of  measures. 
In  fact,  we  may  take  the  collection  of  all  open  balls  as  the  definition  of  the  weak  topology  of 
measures  [25,  pg.  236].  Alternatively,  we  have  the  following  equivalent  characterizations  of  the 
weak  topology. 


Theorem  2.3.  Let  0  be  a  topological  space  with  a-algebra  E©.  Let  P  G  P(@).  The  following 
are  equivalent: 

1.  Be(P); 

2.  {Q\Q{C)  <  Q{C)  +  e,  C  C  0  closed}; 

3.  { Q\Q(0 )  <  Q(0)  +  e,  O  C  0  open}; 

4-  {Q\Q(F)  <  Q(F)  +  e,  F  G  E q,P(8F)  =  0  (such  sets  are  called  P-continuity  sets)}. 

Proof.  See  [25,  pgs.  236-237],  □ 

The  weak  topology  of  measures,  in  turn,  gives  rise  to  notions  of  weak  (topological)  convergence 
of  measures. 


Definition  2.4.  Given  a  sequence  of  measures  Pm  G  V(Q)  for  all  M  —  1, ,  oo,  we  say  Pm 

converges  weakly  to  P,  Pm  —g  P ,  if  any  one  (and  hence  all)  of  the  following  equivalent  conditions 
holds: 

1.  |/0  f{9)dPM{9)  -  f0  f(9)dP(9)  |  ->  0  for  all  f  G  CB(Q); 

2.  lim sup  Pm(C')  <  P{C )  for  all  C  closed  in  0; 

3.  lim  inf  PM(0)  >  P(0)  for  all  O  open  in  Q; 

4.  lim  Pm(F)  =  P(F )  for  all  sets  F  G  E©  such  that  F  is  a  P-continuity  set. 
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The  equivalence  of  the  above  notions  of  convergence  is  often  referred  to  as  the  portmanteau 
theorem  [25,  pgs.  11-12].  We  remark  that  the  notation  Pm  — >  P  is  slightly  abusive  as  it  implies 
weak  *  convergence  when  what  is  meant  is  the  weak  convergence  of  measures.  Yet  it  should  be 
emphasized  that  the  two  notions  are  equivalent  on  the  space  of  probability  measures. 

The  above  definitions  and  theorem  provide  several  characterizations  of  the  weak*  topology 
on  the  set  of  probability  measures.  While  this  characterization  is  mathematically  sufficient,  our 
discussions  of  approximation  and  convergence  would  be  facilitated  by  some  metric  p  defined  on 
the  space  P(@)  which  metrizes  the  above  notions  of  topological  convergence.  That  is,  given  two 
probability  measures  P  and  Q,  we  would  like  p  to  have  the  property  that  Q  G  Be(P)  if  and 
only  if  p(P,  Q )  <  e.  Such  a  metric  could  then  be  used  to  establish  more  intuitive  notions  of 
convergence,  compactness,  etc.,  in  the  space  of  probability  measures.  In  fact,  such  a  metric  does 
exist,  named  for  the  Russian  probabilist  Y.V.  Prohorov  who  first  defined  the  metric  [29]  and 
derived  its  properties. 

Definition  2.5.  Let  (Q,d)  be  a  metric  space.  For  all  F  G  E©,  F  ^  0,  define  the  e-neighborhood 
ofF, 

Fe  =  {(f)  G  @|  inf  d(9,  <f>)  <  e}. 

If  F  =  0,  define  Fe  =  0. 

Definition  2.6.  Let  (0,  d)  be  a  metric  space  and  let  V(Q)  be  the  set  of  all  probability  measures 
on  0.  For  any  two  measures  P,Q  G  V(Q),  the  Prohorov  metric  p  is 

p(P,  Q )  =  inf  {e  >  0|Q(F)  <  P(Fe)  +  e  and  P(F )  <  Q(Fe)  +  e,  for  all  F  G  E©}  . 

This  definition  of  the  Prohorov  metric  is  far  from  intuitive.  We  will  first  prove  that  Definition 
2.6  does  indeed  describe  a  valid  metric.  Next,  we  show  that  p  metrizes  the  weak*  topology. 

Theorem  2.7.  Let  (Q,d)  be  a  separable  metric  space.  Then  p  is  a  metric  on  V(Q). 

Proof.  By  construction,  p  is  nonnegative  and  symmetric  and  p(P,  Q)  =  0  if  P  =  Q.  We  must 
show  p(P,  Q)  =  0  implies  P  =  Q,  and  that  p  is  subadditive. 

Assume  p(P,Q)  =  0.  Then  P(F)  =  Q(F)  for  all  F  G  E©  (and,  in  particular,  for  all  closed 
sets  in  0).  Since  (Q,d)  is  separable,  all  probability  measures  on  0  are  regular  [25,  pg.  7],  and 
thus  are  uniquely  determined  by  their  values  on  closed  sets.  Thus  we  may  conclude  P  =  Q. 

To  show  subadditivity,  assume  p(Pi,  Pf)  =  D  and  p(P-2,  Pi)  =  02-  We  need  to  show  p(Pi,  Pf)  < 
ei  +  62-  From  the  definition  of  p,  the  following  inequalities  hold  for  all  F  G  E©: 

PfiF)  <  PfiF^)  +  e1 
P2(F)<P1(F^)  +  e1 
P2(F)  <  PfiF^2)  +  e2 
Ps(F)  <  P2(Fe*)  +  e2. 

Thus  we  have 

PfiF)  <  P2(Fei)  +  e1<P3  (( F ei)e2)  +  e,  +  e2 
Ps(F)  <  P2(Fe2)  +  e2  <  ((Fe2)ei)  +  e2  +  Cl 
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Trivally,  ( Fei)e 2  C  Fei+e2.  Hence  P\(F )  <  Ps(Fei+e2)  +  61  +  62  and  P-iiF)  <  P\(Fei+£2)  +  ei  +  62- 
Since  these  statements  hold  for  all  F  E  E©,  p(Pi,  P3)  <  6i  +  62-  □ 

Theorem  2.8.  Assume  (@,d)  is  separable.  Assume  Pm  G  P(@)  for  all  M  =  1, . . . ,  00,  and 
P  E  P(@).  Then  Pm  — ■>  P  if  and  only  if  p(Pm,  P)  — >  0. 

Proof.  (4=)  Assume  p(Pm,P )  — »  0.  Then  for  all  e  >  0  there  exists  M  =  M(e)  such  that 

Pm(F)  <  P(Fe)  +  6  and  P(F)  <  PM(Pe)  +  e 

for  all  P  G  Eg.  Let  C  by  any  closed  set  in  0.  (Then  C  G  E©.)  Since  (Q,d)  is  separable,  P  is 
regular  and  there  exists  5  <  e  such  that 

P(C5)  <  P(C)  + 

Take  M  =  M(S/ 2),  Then  p(PM,  P)  <  f  for  all  M  >  M  and 

Pm(C)  <  P(C5/2)  +  ^  (defnofp) 

6  (5 

<  P(C)  +  2^2  (regularity  of  P) 

<  P{C)  +  e  (choice  of  h). 

Hence  lim  sujj  Pm(C)  <  P[C )  for  all  C  closed  in  0  and  Pm  — ■>  P  by  Definition  2.4. 

(=4)  Assume  PM  P .  For  all  e  >  0,  fix  5  such  that  0  <  <5  <  |.  By  the  separability  of  0, 
there  exist  open  sets  Bs(6k)  such  that 

OO 

U  Bs(9k)  =  0. 

k=  1 


Fix  no  such  that 


P 


>1-4. 


(2.2) 


(Such  a  a  value  no  must  exist  since  lim^oo  P  ((J^=1  Bs(9k))  =  1.)  Define  the  collection  of  all 
possible  (nonempty)  unions  of  the  sets  B$(9k),  1  <  k  <  n0, 


C>  = 


Then  for  all  A  E  O,  A  E  E©  and  d A  C  IJl-li  dBs(9k)  so  that  P(dA)  =  0.  Then  A  is  a 
P-continuity  set  and  Pm  (A)  —$■  P(A)  by  assumption.  Thus  there  exists  M  such  that 


Pm  (A)  —  P{A)\  <  S 


(2.3) 


for  all  M  >  M  and  for  all  A  G  O.  In  particular,  for  A  =  U/cli  Bs(0k), 


no 


no 


Pm  U  Bs(6k)  >  P  MJ  Bs(0k 


\k= 1 


\k=l 
>1-26 


(by  (2.3)). 


(2.4) 


Now,  we  need  to  show  PM(P )  <  P(Fe)  +  e  and  P(P)  <  Pm(F€ )  +  e  for  all  F  G  £©.  Let  F  G  £© 
be  arbitrary.  Define 

A  =  [j[B5{ek)\B5{ek)[]F 

where  the  union  is  taken  over  1  <  h  <  n0.  The  following  facts  are  trivially  verified: 


AeO 
Ac  F5 


/  no 

FCA 1J  |J  Bs(0k 


c 


Then  for  all  M  >  M, 


no 


\k= 1 


<P(H)  +  P^U^)J  J 

(by  (2.7)) 

<  P(A)  +  5 

(by  (2.2)) 

5;  Pm  (A)  +  26 

(by  (2.3)) 

<Pm(F5)  +  26 

(by  (2.5)) 

<Pm(P£)  +  6 

(by  choice  of  5). 

and 


£  Pm  (A)  +  Pm  ( 

\ 

(by  (2.7)) 

fy  Pm  (A)  +  26 

(by  (2.4)) 

<  P{A)  +  35 

(by  (2.3)) 

<  P(P5)  +  35 

(by  (2.5)) 

<  P(Pe)  +  e 

(by  choice  of  5). 

Hence  p(Pm,  P)  <  e  for  all  M  >  M. 


(2.5) 

(2.6) 

(2.7) 


□ 


With  these  considerations,  we  have  obtained  the  desired  result-the  weak  topology  of  measures 
(weak*  topology)  is  equivalent  to  the  topology  induced  by  the  Prohorov  metric  on  the  space  of 
probability  measures  over  a  separable  metric  space  (0,  d).  It  should  be  noted  that  in  the  definition 
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of  the  Prohorov  metric,  it  is  sufficient  to  consider  only  sets  F  which  are  closed  (see  [33,  Online 
Supplement]  for  a  proof;  this  follows  from  the  fact  that  probability  measures  are  regular  [25, 
pg.  7]),  so  that  the  definitions  and  results  presented  here  are  in  agreement  with  similar  results 
obtained  previously  [4,  6,  11,  16,  21].  We  now  proceed  to  use  the  Prohorov  metric  to  establish 
a  list  of  propositions,  theorems  and  corollaries  which  will  be  of  use  as  we  return  to  the  original 
problem  of  setting  up  a  least-squares  estimation  framework  for  the  nonparametric  estimation  of 
measures. 


3  Some  Useful  Results 


From  the  results  of  the  previous  section,  we  know  that  given  a  separable  metric  space  (0,  d),  the 
space  (V(Q),p)  of  probability  measures  on  0  is  a  metric  space  with  a  topology  equivalent  to  the 
weak  topology  of  measures  (weak*  topology).  We  now  focus  on  characterising  the  properties  of 
the  space  (V(0),p)  which  will  prove  useful  in  establishing  results  for  the  parameter  estimation 
problem. 

Define 

D  =  {  A.JD.ee}, 

That  is,  D  is  the  space  of  Dirac  measures  on  0,  defined  for  all  F  G  E©  as 


A  ek(F) 


1,  9  k  G  F 
0,  ek  (£  F 


Proposition  3.1.  Let  (@,<f)  be  a  separable  metric  space  and  define  D  C  V(Q)  as  above.  Then 


p(A01,A02)  =mm{d(01,02),l}- 


Proof.  Note  first  that  p(P,  Q)  <  1  for  all  P,Q  G  V(Q).  Take  e  >  d{6 1,  02).  Then  for  all  F  G  E©, 
6i  G  F  =>■  92  G  F€  and  92  G  F  =>•  9\  G  Fe.  Thus  for  all  F  G  E©, 


AefiF)  <  AefiFfi  +  e 
AefiF)  <  Adl(Ffi  +  e. 

Thus  p( A01,Aq2)  <  e.  Since  this  holds  for  all  e  >  d(01;0 2),  we  have  p(A01,  A02)  <  d(0i,02). 

Now  take  e  such  that  p(A01,  A02)  <  e  <  1.  Then 

A01(F)  <  AefiFfi  +  e 
AefiF)  <  A01(Ffi  +  e 

for  all  F  G  E©.  Take  F  =  {9 1}.  Then  the  first  inequality  above  implies  1  <  A<?2  (Be(9 1))  +  e. 
Since  e  <  1,  we  must  have  A©t  {Be{9 1))  >  0  and  thus  92  G  Be{9\).  Hence  d(9i,92)  <  e.  Since 
this  holds  for  all  p(A01,A02)  <  e  <  1,  we  must  have  min{d(0i,  02),  1}  <  p( A01,A02).  Hence  the 
stated  result  holds.  □ 


Corollary  3.2.  The  sequence  {6^,}^  is  Cauchy  in  the  separable  space  (0,  d)  if  and  only  if  the 
sequence  {Ag,.}^  is  Cauchy  in  (V(Q,p). 
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Proof.  Trivial,  by  previous  proposition. 


□ 


Corollary  3.3.  Let  (0,  d)  be  a  separable  metric  space  and  let  the  space  D  be  defined  as  above. 
Then  D  is  sequentially  closed  in  (V(Q),p).  (That  is,  D  is  weak*  sequentially  closed  in  the  space 
of  probability  measures.) 

Proof.  Assume  the  sequence  {A^}^  converges  in  the  Prohorov  metric  to  some  P  G  P(0).  We 
need  to  show  P  G  D.  An  obvious  candidate  is  P  =  Aq  where  6  =  lim  9k,  if  such  a  limit  were  to 
exist.  We  show  that  this  is  the  case. 

Consider  the  sequence  {9k)kLi  and  assume  (for  the  purpose  of  reaching  a  contradiction) 
that  this  sequence  does  not  have  a  convergent  subsequence.  (Then  any  element  in  the  sequence 
i  can  be  repeated  at  most  only  a  finite  number  of  times,  and  we  may  assume  without 
loss  of  generality  that  no  element  of  the  sequence  is  repeated.)  Define  the  set  S  =  { 61,62 , . . .}. 
Then  S  is  (vacuously)  closed  in  0,  as  is  every  subset  of  S.  Now  consider  any  subsequence 
C  =  {Oh,  dk2,  ■  ■  •}  C  S.  Then  by  the  weak  convergence  of  the  set  {A^}^,  and  because  C 
cannot  contain  a  convergent  subsequence 


P(C)  >  lim  sup  Agk  (C)  =  1. 


However,  define  Cf  to  be  the  subset  obtained  by  removing  the  element  9 k  from  S.  (If  9 k  were 
repeated  nk  times,  one  obtains  the  same  result  by  removing  all  instances  of  9k.)  Then  P(Cfi)  =  1 
for  all  k  (by  the  argument  above),  and  hence  P({9k})  =  0  for  all  k.  But  S  is  the  disjoint  countable 
union  of  the  point  sets  {9k},  hence  we  would  have  P(S)  =  0.  But  P(S)  =  1  since  S  is  itself  a 
closed  set.  Thus  we  have  reached  a  contradiction. 

So  the  sequence  {6^.}^  must  have  a  convergent  subsequence,  9kl  — >  9.  But  then  we  must 
have  Agk  — >  A g  by  Corollary  3.2.  Hence  P  =  A^  by  the  uniqueness  of  weak*  limits.  □ 

Definition  3.4.  P  G  V(Q)  is  tight  if  for  all  e  >  0  there  exists  a  compact  set  K  C  0  such  that 
P(K )  >  1  —  e.  A  family  of  measures  n  C  V(Q)  is  tight  if  for  all  P  G  n,  P  is  tight. 

Theorem  3.5.  Assume  (0,  d)  is  complete.  If  for  all  e,  5  >  0  there  exist  9 1, . . . ,  9  m  G  0  such  that 


M 


p  U^(^)  ^-0 


\k= 1 


for  all  P  E  n,  then  n  is  tight. 

Proof.  For  all  e  >  0,  for  each  n  >  1,  take  5  =  1/n.  By  hypothesis,  there  exist  9\\ . . .  ,  9nM  G  0 
such  that 

Mn 


P[\jB1/n(9nk)  >l-2-e 


\k=  1 


for  all  P  G  n.  Define 


OO  Mn 


k = n  u 


n= 1 fc=l 


11 


Then  K  is  closed  and  for  h  >  1/5, 


Mr, 


Mr 


K  c  1J  B1/a(<®  C  |J 


fc=l 


fc=l 


Thus  A'  is  a  totally  bounded  subset  of  a  complete  space.  Thus  K  is  compact.  Moreover,  for  any 

Pen, 


N  Mn 


Mn 

U  BynVl 


Vn=l  k= 1 
N 


n= 1 


=  1  —  lim  P  [  I  i 

)V->  oo  1 

\n= 

N 

>  1  —  lim  P 


n=  1 


fc=l 


i  CO 


i  CN 


U  £i/»M 


,fc=l 


n=l 


=  1  —  6. 


□ 

Theorem  3.6  (Prohorov).  Assume  (Q,d)  is  separable  and  let  II  C  (V(Q),p).  The  following  are 
equivalent: 

•  fl  is  relatively  (sequentially)  compact; 

•  fl  is  tight. 

Proof.  See  [25,  Ch.  1.6]  □ 

Corollary  3.7.  Assume  (Q,d)  is  separable.  Then  (@,d)  is  complete  if  and  only  if  (V(Q),  p)  is 
complete. 

Proof.  (=>)  Assume  {Pm}m= l  is  a  weak*  Cauchy  sequence  in  (7 °{6),p).  We  need  to  show  there 

exists  some  P  G  V(Q)  such  that  Pm  — ->■  P.  To  do  so,  it  is  sufficient  to  show  that  {Pm}m=i  ^as 
at  least  one  convergent  subsequence.  If  we  can  show  that  the  collection  of  measures  is  tight,  then 
it  is  weak*  relatively  sequentially  compact  in  P(@)  by  Prohorov’s  Theorem  (Theorem  3.6).  To 
prove  the  tightness  of  this  collection  of  measures,  we  will  use  Theorem  3.5. 

Let  {9 i,d2, . . .}  be  an  enumeration  of  the  countable,  dense  subset  of  0.  For  all  e,5  >  0,  fix 
rj  <  min{e,  5}/2.  Since  the  sequence  {Pm}m=i  is  a  weak*  Cauchy,  there  exists  M  =  M(r /)  such 
that  Pm{F)  <  Pn^F11)  +  V  and  Pn(F)  <  Pm(Fv )  +  r/  for  all  M,N  >  M  and  for  all  F  e  E©. 
Note  that,  by  construction, 

OO 

U  Bs/2(9k)  =  0. 

k= 1 
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Hence  for  each  1  <  M  <  M, 


and  there  exists  an  no  such  that 


lim  PM  (  (J  B5/2(9k )  j  =  1 


\k= 1 


n  o 


P m  (  [J  B$/2(0k)  >!  —  ??. 


(3.1) 


(Such  an  no  must  exist  separately  for  each  value  of  M,  of  which  there  are  a  finite  number.)  Now 
note  that 


n0 


n0 


n0 


\k= 1 


k= 1 


fc=l 


U  B5/2{9k )  c  U  %2+?,(0fe)  C  U  £«(0fc)- 

\ 

Hence  for  all  M  >  M, 

/  no 

p*  u  )  <  Pjf  (  I  u  B^h) )  I  + v 


no 


\k= 1 


v/c=l 

no 


<Pm  (  U^(0fc)  )  +V 

and  therefore 


\k= 1 


no  > 

U  B«(ek) 

)  A  Pm 

,k= 1  / 

/ 

>  1  - 

>  1  - 

no 


yfc=l 


Finally,  for  1  <  M  <  M 

/  n0 

Pm  U  £*(0fc)  )  >  PM  (  1J  Bs/2(6k) 


no 


\k= 1 


y/c=l 


>1  —  77 
>  1  -e 


(by  defn  of  M) 
(by  (3.2)) 


(by  (3.1)) 
(by  choice  of  77). 


(by  (3.1)) 
(by  choice  of  77). 


(3.2) 


So  Pm  is  tight  for  all  M  =  1, ...  ,00.  Hence  {Pm}m=i  is  tight  and  thus  relatively  compact  in 
V(Q)  and  so  has  a  convergent  subsequence  to  some  P  G  V(Q). 

(4=)  Assume  {0k}kLl  is  Cauchy  in  (Q,d).  We  need  to  show  6k  — >  9  for  some  0  6  0.  Since 
is  Cauchy,  the  sequence  of  Dirac  measures  {A^}^  is  also  Cauchy  by  Corollary  3.2. 

By  the  completeness  of  (V(@),p),  there  exists  P  G  V{&)  such  that  Agk  p  gut  D  (the 
space  of  all  Dirac  measures)  is  closed  by  Corollary  3.3.  Hence  P  =  Ag,  for  some  9  G  0.  Hence 
9k  -G  9.  ^  □ 


13 


Corollary  3.8.  Assume  (Q,d)  is  separable.  Then  (0,d)  is  compact  if  and  only  if  (V(Q),  p)  is 
compact. 

Proof.  (=»)  If  (Q,d)  is  compact  then  every  collection  of  measures  on  0  (and  specifically  V(Q) 
itself)  is  tight  and  thus  relatively  compact  by  Theorem  3.6.  Since  (0,d)  is  compact,  it  is  also 
complete  and  so  is  (V(Q),p)  (by  the  previous  corollary)  so  that  (V(Q,p)  must  be  closed.  Hence 
relative  compactness  is  compactness. 

(<S=)  See  the  proof  of  the  converse  half  of  the  previous  corollary;  given  an  arbitrary  sequence 
{6k}(°=1,  it  must  have  a  convergent  subsequence.  □ 

It  is  interesting  to  note  that  we  may  revisit  the  Riesz  Representation  Theorem  (Theorem  2.1) 
for  an  alternative  proof  of  the  direct  half  of  the  previous  corollary.  Given  the  compactness  of 
(@,d),  by  the  Riesz  Representation  Theorem  we  have  Pm  — ->■  P  if  and  only  if  fpM(f )  — *  fp(f) 
for  all  /  G  Cp (©)•  Now,  consider  the  ball 


B 


{/*  eCi?(0)*|||r||< 


This  is  the  unit  ball  in  Cb(@)*,  which  is  compact  in  the  weak*  topology  by  Alaoglu’s  Theorem 
[30,  pg.  237].  We  may  then  observe  that 


g  B 


1 1/*  1 1  =  1,  and  f*  positive  [ 


is  homeomorphic  to  (V(Q,p).  This  set  is  also  closed  in  B ,  and  hence  compact. 

The  compactness  of  the  space  (V(Q),p)  given  the  compactness  of  (0,  d)  is  of  vital  importance 
for  the  theoretical  framework  to  be  discussed  in  the  next  sections.  In  effect,  one  need  only  show 
that  the  cost  functional  Jn(v,  P )  in  (1.6)  is  a  continuous  function  of  P  in  order  to  be  guaranteed 
the  existence  of  a  minimizer  to  the  least  squares  estimation  problem. 

We  need  one  final  result  which  will  be  useful  in  establishing  computational  tools  for  the 
parameter  estimation  problem. 


Theorem  3.9.  Assume  (0,  d)  is  a  separable,  compact  metric  space.  Let  0^  =  be  an 

enumeration  of  the  countable  dense  subset  of  0 .  Take  Q  C  M  to  be  the  set  of  all  rational  numbers. 
Define 


{M 

PeP(0)|r  =  £p»Afc 

k= 1 


f), 


M 


k=  1 


(That  is,  Vd(Q)  is  the  collection  of  all  convex  combinations  of  Dirac  measures  on  0  with  atoms 
6k  G  0d  and  rational  weights.)  Then  7^(0)  is  dense  in  V(Q),  and  thus  V(Q)  is  separable. 

Proof.  7^(0)  is  obviously  countable.  Let  e  >  0  and  let  P  G  V(0)  be  arbitrary.  We  need  to  show 
there  exists  Pm  G  7^(0)  such  that  p(Pm,  P )  <  e.  As  before,  we  first  note  that  for  each  M  >  1, 

OO 

B\/m  {6k)  =  © 

k=  1 
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so  that  we  may  choose  no  =  no(M )  satisfying 


no 


P  U  B1/M(6k)  >  1  -  1/M. 


\k= 1 


Dehne 


Af  =  Bi/m{9\) 


k- 1 


A//  —  Pi/M^Ok)  —  [J  k  —  2, . . .  no . 

3= 1 

Then  the  sets  A/f ,  1  <  k  <  n0  are  disjoint  and 

n  n 

U  A  k  =  U  B1/M(9k),  1  <n<  n0 


k= 1 
no 


fc=l 


i=>  ( U  )  >  i  - 


\k=  1 


Pick  the  values  G  [0, 1]  D  Q  such  that 


no 


=  1 


no 


fc=l 


f)-pf  <T7- 


fe=i 


2 

M’ 


(To  do  so,  one  may  first  freely  choose  values  pk!  G  [0, 1]  D  Q  such  that 

no 

^ ^ 

k= 1 

and  then  set  /  Y^k=\  Pk’ ■)  Now  dehne 


2M 


no 


Pm  =  £p?aV 


fc=l 
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We  must  show  that  p(Pm,  P)  — >  0  as  M  gets  large.  For  any  /  G  Cb(O), 

p  p  riO  p 

/  f(»)dPM(g)  -  /  f(g)dP(g)  =  WPf  f(gk)  -  /  f(g)dP(g) 

Je  Je  k=1  Je 

no  p  c\ 

<  /  /(^)^(^)  +TTSUP|/(«»)| 

fc=l  fc 

s*  no  p  ^ 

<  /  £m)x(g)A?dm-  /  m<ifw  +ttIi/il 

j©  fe=1 

<  E  /  (/(«fc)x(^)^r  -  mx(g)A«)  dm 

k=  i 

-  [  f(0)X{ u 
Je 

no 

<  v  sup  i/(#a-/(#)|f«) 

k=i 

U<) 

(The  function  x(@)a  is  the  indicator  function  on  the  set  A.)  Recall  C  Bi/M(9k)  by  construc¬ 
tion.  Thus  9  G  Aff  implies  d(9k,9 )  <  1/M  and  for  M  large  enough,  |/(0fc)  —  f(6)\  <  e  for  all 
9  G  A if  and  for  all  k  (since  /  G  Cb(0)  for  0  compact  and  thus  /  is  uniformly  continuous). 
Altogether  we  have 

jj(g)dPM(g)-  jj(g)dP(g)  +  ^  + 

and  the  result  is  proved.  □ 

An  alternative  proof  of  the  above  result  can  be  found  in  [4], 

4  Existence  and  Consistency  of  the  Estimator 

We  now  turn  our  attention  to  characterizing  the  least  squares  estimator  (1.5)  and  its  corre¬ 
sponding  estimate  (1.6).  In  the  present  section  we  ignore  any  computational  approximations 
and  establish  results  concerning  the  theoretical  existence  and  consistency  of  the  least  squares 
estimator  and  estimate,  regardless  of  our  ability  to  compute  them  (although  the  method  of  proof 
does  foreshadow  the  computational  approach  in  the  next  section). 


no  iAM)CdP(9)  +  — 


16 


4.1  Existence  of  the  Estimator 


We  begin  by  proving  the  existence  of  Pn  and  Pn  as  measurable  functions  mapping  a  subset  of  R” 
(that  is,  the  data)  into  the  space  of  probability  measures  on  0.  We  remark  that  the  statement 
of  Theorem  4.1  concerns  the  estimate  Pn  obtained  from  the  data  realizations  v  G  Rn.  This  is 
sufficient  to  establish  the  existence  of  the  estimator  Pn  as  a  measurable  function  as  well,  since 
the  random  vector  V  is  by  definition  a  measurable  function  from  a  probability  triple  into  Rn, 
and  the  composition  of  measurable  functions  is  measurable. 

Theorem  4.1.  Define  the  function  Jn  :  Rn  x  V(Q)  — >■  M  according  to  Equation  (1.6).  Assume 
(0,  d)  is  separable  and  compact  and  take  the  space  of  probability  measures  V(Q)  with  the  Prohorov 
metric  p.  Assume  further  that  Jn(-,  P)  is  a  measurable  function  from  Rn  — >  R  for  each  P  G  V(Q), 
and  that  Jn(v,  •)  :  V(Q)  — *  R  is  continuous  for  each  v  G  Rn.  Then  there  exists  a  measurable 
function  Pn  :  Rn  — >  V(Q)  such  that 

J (v,  Pn  (v) )  =  inf  J (v,  P ) . 

Pev{Q) 

Proof.  Let  0^  =  {9k}'^=l  be  an  enumeration  of  the  countable  dense  subset  of  0  as  used  in 
Theorem  3.9.  For  each  M  >  1,  define 


M 


Vm(Q)  =  \P  e  Vd(Q)  P  =  J2Pk^ok,0k  e  {efiZi  \  c  Vd(Q). 


k= i 


(4.1) 


(That  is,  Vm  is  the  set  of  all  discrete  measures  consisting  of  a  convex  combination  of  M  Dirac 
measures  with  atoms  in  {OpfiP  weighted  with  rational  coefficients.)  Thus  Vm  is  countable.  Let 
{PjM }fi=i  be  an  enumeration  of  the  elements  of  Vm-  (We  remark  that,  because  the  M  nodes 
6k  are  fixed  in  advance,  the  space  Vm  can  be  analogously  considered  as  a  subset  of  RM,  a  fact 
which  will  be  exploited  in  some  of  the  notation  below.)  Finally,  define  Vff  =  { Pfr }j=i ,  flie  first 
J  enumerated  elements  of  Vm- 

Fix  J  >  1.  Dehne  the  function  Pf1  (v)  implicitly  as 

J(v,PfI(v))=  min  J(v,P). 
p&rfi 

Such  a  function  must  exist  because  the  minimum  is  begin  taken  over  a  finite  number  of  elements 
from  a  point  set;  if  the  minimum  occurs  at  multiple  elements  of  Vjf ,  we  may  arbitrarily  choose 
the  element  which  comes  first  in  the  enumeration  so  that  the  function  Pfifiv)  is  well-defined. 
First,  we  show  that  Pfr (v)  is  measurable. 

Let  F  G  SPM .  (Thus  F  is  a  finite  point  set.)  We  must  show  that  the  set  B  defined  as 


B  = 


Pf(F) 


G  F 


is  contained  within  measurable  sets  Er™  in  Rn.  Since  F  is  a  finite  point  set,  we  can  dehne  for 
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each  P-4  G  F  the  sets 


Bj  = 


P 


M , 


V)  =  P‘ 


M 


=  <  V 


J(v,Pj(v))=  min  J(v,P ) 


PfZ'pM 


JftPiW) 


min  J(v,P ’M) 
i  <j<J  3 


By  assumption,  the  functions  J(v,PAI)  are  measurable  from  Mn  into  M  for  all  PAI ,  j  >  1.  The 
minimum  over  a  finite  set  of  functions  is  also  measurable,  as  is  the  test  for  equality.  Hence 
Bj  G  Erti.  Finally,  B  =  U Bj,  the  union  being  over  the  finite  number  of  sets  Bj,  hence  B  G  Er™ 
and  the  function  Pj1  (v)  is  measurable. 

As  mentioned  previously,  we  can  identify  the  function  Pj1  (v)  with  [0, 1]A/  D  QA/  via  the  map 
Pj1  (v)  i — y  (pA/(u), . . .  ,pAf(u)).  Let  pAf  be  the  first  component  of  the  vector  representation  for 
Pj3 (v)  .  Now  consider  the  sequence  [pj1  }JLi-  Define 

pA: !(v)  =  liminfpAf(u). 

J—>  OO 


Since  each  p^iy)  is  a  measurable  function,  so  is  pAf(u).  Also,  since  the  space  [0, 1]M  is  compact, 
there  must  exist  a  convergent  subsequence  Pjf  of  (the  vector  representation  of)  Pj1  to  some 
vector  (pAf(u),p^(u), . . . , p|J(u)),  which  can  be  identified  with  a  measure  Pm-  Now 

inf  Jn  (v,  (pf  ,P2,- Pm))  <Jn(v,  PM) 

[o,i]M-1nQM-1 

=  lim  Jn(v,  Pjf ) 

=  lim  inf  Jn(v,P) 

i  p&vm  v 

=  inf  Jn(v,P). 


The  first  equality  comes  from  the  definition  of  Pm  and  the  continuity  of  the  function  J;  the 
second  equality  comes  from  the  definition  of  Pm  as  the  limit  of  the  probability  measures  Pp  \ 
the  final  equality  arises  from  the  density  of  {PAI}  in  V . 

Now,  define  (with  some  abuse  of  notation) 

J£’M)(V,  P)  =  Jn(v,  (pf  ,P2,  ■  ■  ■ ,PM ))• 

Applying  the  same  arguments  above  inductively  on  Jn’M\  we  obtain  a  set  of  measurable  functions 
p{u(u), . . .  ,Pm(v)  such  that 

Jn(v,  (pf ,  •  •  •  ,Pm))  =  inf  Jn(v,  P ) 

and  we  have  proven  the  existence  of  a  measurable  function  Pm  G  Vm  mapping  Mn  — y  V(Q) 
which  minimizes  the  cost  functional  Jn.  We  conclude  the  proof  by  noting  that 

Jn(v,P(i J))  =  inf  Jn(v,  P)  =  lim  inf  Jn(v,  P)  =  lim  Jn(v,  PM). 

PeV(e)  M->ooPg-Pm(0)  M—too 
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As  the  final  term  in  the  equation  above  is  the  composition  of  measurable  functions,  it  is  measur¬ 
able,  and  thus  J(v,P(v))  must  be  measurable,  so  that  P(v)  must  be  measurable  as  well.  □ 

4.2  Consistency  of  the  Estimator 

Theorem  4.1  shows  that  for  any  fixed  n  the  estimator  Pn  and  the  corresponding  estimate  Pn  exist 
as  measurable  functions  mapping  the  data  into  the  space  of  probability  measures.  An  obvious 
question  then,  is  what  the  resulting  measures  Pn  or  Pn  represent.  Since  Pn  is  just  a  realization  of 
Pn  (given  a  specific  set  of  data),  we  focus  on  characterization  of  the  properties  of  the  estimator 
Pn.  Given  the  problem  formulation  (1.5)  and  the  statistical  model  (1.3),  one  would  certainly 
hope  that  the  estimator  provides  some  information  regarding  the  underlying  ‘true’  distribution 
Pq.  In  particular,  we  would  hope  that  Pn  — >  Pq  in  some  appropriate  sense.  If  this  is  the  case, 
then  the  estimator  is  said  to  be  consistent.  Of  course,  the  estimator  itself  is  a  random  variable, 
and  thus  this  convergence  must  be  discussed  in  terms  of  probability.  With  this  in  mind,  we 
consider  the  following  set  of  assumptions. 

(Al)  For  any  fixed  n,  the  error  random  variables  {£j}”=1  are  independent  and  identically  dis¬ 
tributed,  defined  on  some  probability  triple  (Q,Eq,  Pq). 

(A2)  For  £  =  {£\r... .  ,£n ),  E[£)  =0  and  V ar[£]  =  a2In,  where  In  is  the  n  x  n  identity  matrix. 

(A3)  (O,  d )  is  a  separable,  compact  metric  space;  the  space  'P(O)  is  taken  with  the  Prohorov 
metric  p. 

(A4)  For  all  j,  1  <  j  <  n,  tj  G  T  for  some  compact  space  T. 

(A5)  The  model  function  v  G  C(V(0),  C{T)). 

(A6)  There  exists  a  measure  p  on  T  such  that 

-y>i)=  [  9{t)dHn{t)  /  g(t)dp{t) 

n  j=1  Jt  Jt 

for  all  g  G  C(T). 

(A7)  The  functional 

J0(P)  =  a2  +  J  (v(t]  P0)  —  v(t;  P))2  dp(t) 

is  uniquely  minimized  at  Pq  G  V(Q). 

Assumption  (Al)  establishes  the  probability  triple  on  which  the  error  random  variables  £j  are 
assumed  to  be  defined.  As  we  will  see,  this  probability  triple  will  permit  us  to  make  probabilistic 
statements  regarding  the  consistency  of  the  estimator  Pn.  These  assumptions  as  well  as  the  two 
theorems  below  follow  closely  the  theoretical  results  of  [15]  which  establish  the  consistency  of  the 
ordinary  least  squares  estimator  for  a  traditional  nonlinear  least  squares  problem.  The  key  idea 
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is  to  first  argue  that  the  functions  Jn(V ;  P)  converge  to  Jo  as  n  increases;  then  the  minimizer  Pn 
of  Jn  should  converge  to  the  unique  minimizer  P0  of  J0  [1]. 

Because  the  functions  Jn  are  functions  of  the  vector  V,  which  itself  depends  on  the  random 
variables  Sj,  these  functions  are  themselves  random  variables,  as  are  the  estimators  Pn.  Though 
we  have  generally  refrained  from  doing  so  up  to  this  point,  it  will  occasionally  be  convenient 
to  evaluate  these  functions  at  points  in  the  underlying  probability  triple.  Thus  we  may  write 
Jn(V]  P)(u),  £j(ca),  etc-;  whenever  the  particular  value  of  u  is  of  interest. 

Theorem  4.2.  Under  assumptions  (Al)-(Al),  there  exists  a  set  A  G  with  Pn(A)  =  1  such 
that  for  all  lo  G  A, 

- UV ;  P)(u)  — ►  J0(P)(u) 
n 

as  n  — *  oo  and  for  each  P  G  V(Q).  Moreover,  the  convergence  is  uniform  on  V(Q). 

Proof.  As  in  [15],  the  proof  will  proceed  in  three  parts.  First,  for  any  fixed  element  P  G  V{&), 
a  set  AP  is  constructed  with  P^(AP)  =  1  such  that  the  convergence  statement  holds.  The  sets 
AP  are  then  used  to  construct  a  set  A  as  described.  Finally,  the  uniform  convergence  is  shown. 
Let  P  G  V(Q)  be  fixed.  We  may  rewrite 

1  n 
n  , 

3= 1 

1  U 

~  +  v(tA  P°)  _  P))2 

3= 1 

i  E ' £i  + 1 E  -  v^s-  P))  Si  +  ^  E  Pol  -  p ))2  - 

3= i  i=1  i=1 

We  consider  the  three  terms  on  the  right.  For  the  first  term,  define 


~Jn {V-  P)  = 

n 


Bx  = 


By  the  Strong  Law  of  Large  Numbers,  P(i{B\)  =  1.  For  the  third  term,  observe  that 


~  v(tvp)f  ->•  [  {v(t]P0)  -  v(t;  P))2  =  J0(P)  -  a2 

U  3=1  JT 

by  assumption  (A6)  and  the  continuity  of  v(t;  •).  (Note  also  that  this  convergence  is  independent 
of  ui  G  Ul.)  For  the  second  term,  define 

Sj  =  (v(tj]  Pq)  —  v(tj,  P))  Sj. 
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Then 


E[£j]  =  0 

Var[£j]  =  a2  (■ v(tj ;  P0)  -  v(tj ;  P))2 

<  a2  sup  (v(t\  P0)  -  v(t]  P))2 

teT 

<  Mp 


where  the  final  inequality  follows  from  the  continuity  of  v  and  the  compactness  of  T.  Hence  we 
have 


£ 


3=1 


j2 


< 

3=1 


1 

-y  <  OO 

J2 


and  therefore  the  set  BP  defined  by 


Bp  —  \  uj  G 


2  ZA 


-  V  po)  -  v(tj ;  P))  ->•  0 


n 


3=1 


satisfies  Po(Pp)  =  1  by  Kolmogorov’s  Law  of  Large  Numbers.  Finally,  we  may  define  Ap  = 
Pi  D  BP.  Then  Pq(Ap )  =  1  and  d  Jn ( K ;  P)(ca)  — *  J0(P)  for  each  u;  G  Ap,  which  completes  the 
first  part  of  the  proof. 

For  the  second  part  of  the  proof,  we  must  find  a  set  A  with  Pn(A)  =  1  such  that  y  Jn(V ;  P)  (ca)  — > 
J0(P)  for  each  u;  G  A  and  for  all  P  G  P(0).  Naively,  we  desire  A  =  nAp,  but  this  intersection 
is  (in  general)  uncountable.  Rather,  we  construct  the  set  A  using  the  dense  countable  subset  of 
P(@)  (Theorem  3.9).  Define 


Ai  = 


3=1 


Again  by  the  Strong  Law  of  Large  Numbers,  Pq(A]  )  =  1.  Now  define  the  set  Vd(Q)  as  before 
and  set 


A 


n Ap 

Perd 


Since  the  intersection  is  taken  over  a  countable  number  of  sets,  each  having  probability  one  (with 
respect  to  Pq),  Pq(A)  =  1.  To  complete  the  second  part  of  the  proof,  we  must  show  that  A  C  AP 
for  all  P  G  P(@)  (and  not  merely  for  all  P  G  P^(0),  which  holds  by  the  definition  of  A).  If  this 
is  the  case,  then  ^J(V’;  P)(cu)  — >  Jo(P)  for  all  oo  in  A  and  for  all  P  G  P(0). 

Consider  any  P  G  P(0)  and  take  u  G  A,  e  >  0.  Since  u  G  d,  w  G  4  and  we  may  choose  n j 
such  that  for  all  n  >  nlt 

n 

~  X]  l^'l  <  1  +  ^[|^l|]- 

^  1 
3= 1 
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By  the  continuity  of  v  and  the  density  of  'Pd(Q)  in  V(Q),  we  may  choose  Pm  G  7^(0)  such  that 

e 


sup  | v(t;  P )  -  v(t;  PM)\  < 


teT 


4(£[|fi|]  +  l) 


Finally,  u  G  A  implies  u  G  APm  which  in  turn  implies  u  G  BPm.  Thus  we  may  choose  n2  such 
that  for  all  n  >  n2, 


n 


X  Po)  -  v{tj\ PM))  £j 


3= i 


< 


Then  for  n  >  maxjni,  n2}, 


~Jn(V-  P)-Jo(P) 

n 


< 


a 


’2 

n  z — ' 

i=i 


+ 


-  XI  po)  -  ufo;  i3))  £? 


j=i 


n 


X  po) -  p))2  -  /  M*;  po)  -  v(^  p))2 


i=i 


The  hrst  term  goes  to  zero  since  u  G  A  implies  uj  G  B\.  The  final  term  goes  to  zero  by 
assumptions  (A5)  and  (A6).  For  the  second  term, 


n 


YwtpPo)-^-^))^ 


3= 1 


< 


n 


Y(v(tj-p0)-v(tj,pM))sj 


3  =  1 


+  -  X  Pm)  _  U(*A  P)l  '  \£i 


e 

<  2 


<2+2 


2  sup  |u(t;  PM)  -  v(t]  P)  |j  (  ~  X  lp 


j=i 

n 


3=1 


4(S[|£i|]  +  l) 


(S[|£i|]  +  1) 


<  e. 


Thus  P)(u)  —>  Jq{P)  and  thus  ta  G  Ap.  Thus  A  C  Ap  for  all  P  G  7^(0)  and  the  second 

part  of  the  proof  is  complete. 

Finally,  we  must  show  the  convergence  is  uniform  on  V(Q)  for  to  e  A.  To  do  so  we  will 
show  that  the  sequence  of  functions  -Jn(V]P)(u>)  is  equicontinuous  (viewed  as  functions  of  P) 
and  then  use  the  Arzela-Ascoli  Theorem.  For  fixed  u  G  A,  let  e  >  0.  Take  P  G  V(Q).  By  the 
continuity  of  v  (A5)  and  compactness  of  T  (A4),  there  exists  a  6  >  0  such  that 


sup 

teT 


v(t;  P)-v(t ;  P) 


< 


6  l£[|£i|]  +  l’supteT|u(f;P0)| 


sup 

teT 


vit-Pf 


vit-py 


< 


e 

3’ 


for  all  P  G  Bg(P).  Since  u>  G  A,  u  G  Ai  and  we  can  choose  N  such  that  n  >  N  implies 

1  n 

-5>i<£[N]  +  i. 

3=1 
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Then  for  n>  N  and  for  all  P  G  B$(P ), 


l  Jn{V-  P)  ~  —Jn{P) 


n 


n 


< 


1  n  t  11 

~  E  (SJ  +  v U  Po)  -  VU  P ))2  --  J2  (Sj  +  v(tA  Po)  “  VU  P) 

j=l  3=1 

1  n 

-  51  (2^i  +  Po)  -  v{tj]  P )  -  u(t,-;  P))  (ufo;  P)  -  ufo;  P)) 

n  3= 1 


< 


n 


55  +  u(*j;  p°))  (Wa p)  -  p) 


i=i 


-Eh^p)2-^p)2 


i=i 


j=i 


-  -  E  ifji  ( sup  p) -  ^ p) 

™  “  V  teT 


n 

E- 

Z— /  -n 


j=i 
+  sup 

teT 


n  \t€T 


sup \v(t;  P0)|  )  (  sup  \v(t;  P)  -  v(t;  P)| 


teT 


v(t-P)2-v(t-Pf 


„  e  e  e 
“3  +  3  +  3=e‘ 

Thus  the  sequence  of  functions  pJn(V;  P)(co)  is  equicontinuous  for  each  u  G  A  and  by  the  Arzela- 
Ascoli  Theorem,  p  Jn(V;  P)(u> )  — *  Jo(P)  uniformly  on  compact  subsets  of  P(@),  and  hence  on 
P(0)  itself.  “  '  □ 

'll)  * 

Theorem  4.3.  Under  assumptions  (Al)-(Al),  the  estimators  Pn  — »  Po  as  n  — >  oo  with  proba¬ 
bility  1.  That  is, 


Po  a; 


pnCi/)M^Po  =i. 


Proof.  Take  the  set  A  as  in  the  previous  theorem  and  fix  u  G  A.  Then  by  the  previous  theorem, 
pJn(V)  P)(u>)  — >  J0(P)  for  all  P  G  P(0).  Let  5  >  0  be  arbitrary  and  define  O  =  Bs(P0).  Then  O 
is  open  in  P(0)  (in  the  subspace  topology)  and  0G  is  compact  (again,  in  the  subspace  topology). 
Since  P0  is  the  unique  minimizer  of  J0(P)  by  assumption  (A7),  there  exists  e  >  0  such  that 

UP)  -  Upo)  >  e 

for  all  P  G  Oc .  By  the  previous  theorem,  there  exists  n0  such  that  for  n  >  Uq, 


—  Jn(V ;  P)(u>)  —  Jo{P) 

n 


e 

<  4 
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for  all  P  G  V(Q).  Then  for  n  >  no  and  P  G  0C , 


n 


Jn{V ;  P)M  -  Jn{V ;  P0)M  =  -</n(C  P)M  -  J0(P)  +  J0(P)  -  ^o(Po)  +  ^o(Po) 

n 

--Jn(y;Po)M 

n 


> 


+  e 


>  0. 


But  Jn(V]  pn)(u)  <  Jn(y ;  Po)(w)  by  definition  of  Pn.  Hence  we  must  have  Pn  G  O  =  Bj(P0)  for 
all  n  >  no,  which  implies  Pn{u)  — >  Po  since  5  >  0  was  arbitrary.  □ 

Theorem  4.3  establishes  the  consistency  of  the  estimator  (1.5).  Given  a  set  of  data  v,  it  follows 
that  the  estimate  Pn  corresponding  to  the  estimator  Pn  will  converge  to  the  true  distribution 
Po  under  the  stated  assumptions.  We  remark  that  these  assumptions  are  not  overly  restrictive 
(compare  [15,  20,  28])  though  some  of  the  assumptions  may  be  difficult  to  verify  in  practice. 
Assumptions  (A3)-(A5)  are  mathematical  in  nature  and  may  be  verified  directly  for  each  specific 
problem.  Assumptions  (Al)  and  (A2)  describe  the  error  process  which  is  assumed  to  generate  the 
collected  data.  While  it  is  unlikely  that  one  will  be  able  to  prove  a  priori  that  the  error  process 
satisfies  these  assumptions,  posterior  analysis  such  as  residual  plots  [24,  Ch.  3]  can  be  used  to 
investigate  the  appropriateness  of  the  assumptions  of  the  statistical  model.  Assumption  (A6) 
reflects  the  manner  in  which  data  is  sampled  and,  together  with  Assumption  (A7),  constitutes 
an  identifiability  condition  for  the  model.  The  limiting  sampling  distribution  function  //  may  be 
known  if  the  experimenter  has  complete  control  over  the  values  tj  of  the  independent  variables 
(e.g.,  if  the  tj  are  measurement  times)  but  this  is  not  always  the  case. 


5  Computational  Convergence 

To  this  point,  the  analysis  has  focused  on  the  properties  of  the  estimators  Pn  and  the  resulting 
estimates  Pn.  However,  it  is  generally  not  possible  to  solve  the  optimization  problems  (1.5) 
or  (1.6)  for  Pn  or  Pn  as  a  function  of  V  or  v.  As  a  result,  approximate  (generally  numerical) 
methods  must  be  used  in  order  to  solve  (1.7)  and  obtain  an  approximate  estimate  P^M.  We 
must  ascertain,  then,  how  the  approximate  estimate  PffM  relates  to  the  exact  estimate  Pn  (for 
any  fixed  value  of  n.)  These  results  are  outlined  in  [21]  and  are  included  again  here  with  proof. 

Theorem  5.1.  Let  (0,d)  be  a  compact,  separable  metric  space  and  consider  the  space  (V(Q),p) 
of  probability  measures  on  0  with  the  Prohorov  metric,  as  before.  Let  "PM(0)  be  as  defined  in 
(4.1).  Assume 

1.  the  map  P  i-»  Jff(v,  P )  is  continuous  for  all  n,  N; 

2.  for  any  sequence  of  probability  measures  P^  — >■  P  in  V(Q),  vN(t;  Pjf)  — »  v(t\  P )  as  N,  k  — * 
oo; 

3.  v(t;  P )  is  uniformly  bounded  for  all  t,  P . 
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Then  there  exists  minimizers  PffM  satisfying  (1.7).  Moreover,  for  fixed  n,  there  exists  a  subse¬ 
quence  (as  M,N  — »  oo)  of  the  approximate  estimates  PpM  which  converges  to  some  (possibly 
non-unique)  P*  which  satisfies  (1.6). 

Proof.  For  any  fixed  n,  the  existence  of  the  minimizers  PpM  follows  from  the  compactness  of  the 
space  (V(Q),p)  (Corollary  3.8)  and  the  continuity  of  the  map  P  (->•  J(v\P )  (Assumption  1).  By 
definition,  these  minimizers  satisfy 


Jpv,P^u)<^(v,PM)  (5.1) 

for  all  Pm  €  Vm{®)  and  for  each  n,  N. 

Next,  we  show  an  auxiliary  result.  Consider  any  sequence  Pk  — >  P  in  P(0)  (see  Assumption 
2).  Then 


J^(v,Pk)-Jn(v,P) 


Pfc))2  ~  _  v fa  p))2 

3  3 

( 2vi  -  yN  fa  Pfc)  _  P))  (' v  fa  p)  -  pp) 


I  3 

<  M  \v(tj]  P)  —  vN(tj]  PP)  |  — >•  0, 

3 


where  we  have  used  the  uniform  boundedness  of  v(t;  P)  (Assumption  3)  as  well  as  Assumption 
2. 

Now,  we  return  to  (5.1).  Since  V(Q)  is  compact  there  must  exist  (possibly  after  reindexing) 
a  limit  Pfi  =  lim  PpM-  Next  consider  any  P  G  P(0).  By  Theorem  3.9,  it  is  possible  to  construct 
a  sequence  of  measures  Pm  G  T,m(©)  C  P(0)  so  that  Pm  — >  P  in  V(@).  Hence,  taking  limits 
in  (5.1)  as  M  and  N  go  to  infinity  (where  we  tacitly  assume  the  indices  have  been  renumbered 
according  to  the  convergent  subsequence),  we  have 

Jn(Pfi)  <  Jn(P )  for  all  P  G  P(0), 

and  we  see  that  P*  satishes  (1.6).  □ 

This  theorem  provides  a  set  of  conditions  under  which  a  subsequence  of  approximate  estimates 
PpM  converges  to  the  estimate  Pf  of  interest.  This  estimate  is  itself  a  realization  (for  a  particular 
data  set)  of  the  estimator  Pn  which  has  been  shown  to  exist  and  to  be  consistent,  so  that 
Pn  P0  with  probability  one.  Thus  we  have  some  reasonable  assurance  that  a  computed 
approximate  estimate  PpM  reflects  the  true  distribution  P0.  The  assumptions  of  Theorem  5.1  are 
not  restrictive.  In  typical  problems  (and,  indeed,  in  the  assumptions  of  other  theorems  appearing 
in  this  document)  it  is  assumed  that  the  parameter  space  0  as  well  as  the  independent  variable 
space  T  are  compact  (see,  e.g.,  Section  4).  In  such  a  case,  Assumptions  1  and  3  above  are  satisfied 
if  the  individual  model  solutions  y(t;  6)  are  continuous  on  T  x  0.  Assumption  2  is  then  simply 
a  condition  on  the  convergence  of  the  numerical  procedure  used  in  obtaining  model  solutions. 
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Significantly,  the  Prohorov  Metric  Framework  is  computationally  constructive.  In  practice, 
one  does  not  construct  a  sequence  of  estimates  for  increasing  values  of  M  and  TV;  rather,  one  fixes 
the  values  of  M  and  N  to  be  sufficiently  large  to  attain  a  desired  level  of  accuracy.  By  Theorem 
3.9,  we  need  only  to  have  some  enumeration  of  the  elements  of  Vm{®)  hi  order  to  compute  an 
approximate  estimate  P^M.  (We  will  not  consider  the  choice  of  N,  as  this  will  depend  upon  the 
numerical  framework  by  which  approximate  model  solutions  vN(t ;  P )  are  obtained.)  Practically, 
this  is  accomplished  by  selecting  M  nodes  in  0,  The  optimization  problem  (1.7)  is 

then  reduced  to  a  standard  constrained  estimation  problem  over  Euclidean  M-space  in  which 
one  determines  the  values  of  the  weights  p^1  corresponding  to  each  node.  Thus, 


Pn,M  =  arg  min  V  (Vj  -  v(tj]  P)Y 

■Pm(o)  ' 

1  =  1 


arg  min  ) 

vM(e)  ^ 


3= 1 
n 


'0 

M 


Cy(t,-,e)dP(0)) 


> 


arg  min  > 

RM 


J2  cy^P9k)p\ 


M 

k 


,fc= 1 


where  in  the  final  line  we  seek  the  weights  pM  =  (pf1 , . . . ,  p%)T  G  MM  =  { pM\p ff  G  M+,  Y^k=  l  Ptf  = 
1}.  These  are  sufficient  to  characterize  the  approximating  discrete  estimate  P^m  since  the  nodes 
are  assumed  to  be  fixed  in  advance.  Moreover,  define 


Hkl  =  2j2(Cy(tr,dk))(Cy(tj-dl)) 

3 

fk  =  -2  {Cy{tj\Qk)) 

c=Y^  • 

3 

Then  one  can  equivalently  compute  [11] 

Pn,M  =  arg min  ( \  ( PM ) T HpM  +  fpM  +  c]  .  (5.2) 

RM  J 

From  this  reformulation,  it  is  clear  that  the  approximate  problem  (1.7)  has  a  unique  solution  if 
Pt  is  positive  definite.  If  the  individual  mathematical  model  (1.2)  is  independent  of  P,2  then  the 

2  This  independence  of  the  individual  model  on  the  population  distribution  is  strongly  suggested  by  our  choice 
of  notation  for  the  individual  solutions,  y(t',0,ij}).  In  many  problems  of  interest,  this  is  a  perfectly  reasonable 
assumption.  For  instance,  in  a  size-structured  biological  model  [9,  11,  13,  16,  17],  the  individual  rate  of  growth 
may  vary  across  the  population,  but  the  rate  of  growth  of  an  individual  is  unaffected  by  the  rates  of  growth 
of  his  neighbors.  It  is  possible  however,  that  the  individual  mathematical  model  may  depend  upon  the  pop¬ 
ulation  distribution,  P).  For  instance,  in  a  size-structured  population  model,  fast-growing  individuals 

may  out-compete  their  slower  growing  neighbors  for  limited  resources.  Such  examples  also  arise  in  models  of 
electromagnetic  polarization  and  deformations  of  viscoelastic  materials.  See  [2,  Sec.  14.1.2]  for  a  more  complete 
discussion. 
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matrices  H  and  /  can  be  precomputed  in  advance.  Then  one  can  rapidly  (and  exactly)  compute 
the  gradient  and  Hessian  of  the  objective  function  in  a  numerical  optimization  routine.  As  M 
grows  large,  the  quadratic  optimization  problem  (5.2)  becomes  poorly  conditioned  [11].  Thus 
there  is  a  trade-off:  M  must  be  chosen  sufficiently  large  so  that  the  computational  approximation 
is  accurate,  but  not  so  large  that  ill-conditioning  leads  to  large  numerical  errors.  The  efficient 
choice  of  M  as  well  as  the  choice  of  the  nodes  {0k}kL\  is  an  open  research  problem. 

It  should  be  acknowledged  that  the  uniqueness  of  the  computational  problem  (i.e.,  when 
H  is  positive  definite)  is  not  sufficient  to  ensure  the  uniqueness  of  the  limiting  estimate  P*  in 
Theorem  (5.1)  (as  there  could  be  multiple  convergent  subsequences).  However,  if  Jn(v]  P)  is 
uniquely  minimized,  then  every  subsequence  of  P^m  which  converges  as  N,  M  grow  large  must 
converge  to  that  unique  minimizer.  Moreover,  under  assumptions  (Al)-(A7)  in  Section  4,  it  has 
been  shown  that  ^Jn(v,  P)  — >  Jo(P)  (as  n  grows  large)  with  probability  one,  and  the  function 
J0(P)  is  assumed  to  be  uniquely  minimized  by  Pq. 


6  Extensions  to  Other  Error  Models 


In  this  final  section,  we  comment  on  generalizations  of  the  statistical  and  error  models  (1.3)  and 
(1.4).  As  noted  in  Section  1,  the  estimator  (1.5)  is  premised  upon  an  assumption  of  independent, 
identically  distributed,  constant  variance  additive  error, 

Vj  =  v(t]  Pq)  +  £j, 


which  may  be  rewritten 

V  =  v(t;P0)  +  £  (6.1) 

where  by  assumption 

E[£\  =  0 
Var[£ )  =  cr2In. 

While  such  an  assumption  is  common,  many  physical  and  biological  problems  are  not  accurately 
described  by  such  a  simple  statistical  model.  Thankfully,  the  results  presented  above  can  be 
easily  extended  to  cover  a  larger  class  of  error  models.  Consider  the  more  general  error  model 

E[£\  =  0 

V ar[£]  =  a2W  =  cr2diag(w(t1)2, . . . ,  w(tn)2),  (6.2) 

where  the  function  w(t)  >  0  is  a  continuous  weighting  function.  Such  a  statistical  model  arises 
from  an  observation  process  in  which  measurement  errors  are  independent  but  are  not  identically 
distributed.  This  formulation  includes  the  special  case  that  w(t)  =  v(t;Po),  which  is  commonly 
called  a  relative  error  [24]  or  constant  coefficient  of  variance  (CCV)  error  model  [20,  26].  (Of 
course,  in  such  a  case,  one  does  not  actually  know  P0  and  an  iterative  estimation  procedure  must 


27 


be  used  [24,  26].)  Now  define  L  =  diag{w{t\ ), . . .  ,w(tn)).  It  follows  that  LL 1  =  L2  =  W  and 
L”1  exists  (since  it  is  assumed  w(t)  >  0  for  all  t).  Applying  L-1  to  both  sides  of  (6.1), 

L~lV  =  L~lv{t,  PQ)  +  L~l£ 
or 

Z  =  V{t,  P0)  +  ff,  (6.3) 

where  Z,  z7,  and  ff  have  the  obvious  definitions.  Moreover,  assuming  the  distributions  from  which 
the  random  errors  are  drawn  are  uniquely  determined  by  their  first  two  statistical  moments,  the 
random  variables  rjj  are  independent  and  identically  distributed  with  constant  variance.  Thus 
the  theory  presented  in  this  document  can  be  applied  to  the  transformed  model  (6.3). 

Additional  generalizations  are  also  possible.  For  instance,  the  matrix  W  may  depend  upon 
additional  nuisance  parameters  7.  In  particular  it  has  been  shown  that  histogram  data  from  a 
flow  cytometer  is  well-described  by  an  error  model  of  the  form  W  =  diag(w(ti)1 , . . . ,  w(tn)7)  for 
some  scalar  7  [20,  23,  32],  Such  nuisance  parameters  can  be  estimated  in  an  iterative  procedure 
[28]  and  the  theory  presented  in  this  document  is  essentially  unchanged.  If  the  observations  are 
not  independent,  then  the  matrix  R  in  (1.4)  will  not  be  diagonal.  In  such  a  situation,  the  theory 
presented  in  this  report  can  still  be  applied  provided  R  is  diagonalizable  and  this  diagonalization 
is  sufficient  so  that  the  resulting  transformed  errors  are  independent  and  identically  distributed 
(such  is  the  case,  for  instance,  for  autoregressive  errors  of  order  r  <  n). 

7  Concluding  Remarks 

In  this  document  we  have  defined  a  parameter  estimation  problem  in  which  one  has  a  mathe¬ 
matical  model  describing  the  dynamics  of  an  individual  biological  or  physical  process  but  data 
which  is  sampled  from  a  population  of  individuals.  Because  each  individual  is  assumed  to  be  de¬ 
scribed  by  a  unique  set  of  parameters,  the  data  is  described  not  by  a  single  parameter  but  by  the 
probability  distribution  (over  all  individuals)  from  which  these  individual  parameters  are  sam¬ 
pled.  Theoretic  results  for  the  nonparametric  measure  estimation  problem  are  presented  which 
establish  the  existence  and  consistency  of  the  estimator.  A  previously  proposed  and  numerically 
tested  computational  scheme  is  also  discussed  and  its  convergence  is  proven. 

Several  open  problems  remain.  First,  while  the  computational  scheme  is  simple,  it  is  not 
always  clear  how  one  should  go  about  choosing  the  M  nodes  9k  from  the  dense  subset  of  0  which 
are  then  used  to  estimate  weights  pk ■  From  a  theoretical  perspective,  the  nodes  need  only  to  be 
added  so  that  they  ‘fill  up’  the  parameter  space  in  an  appropriate  way.  In  practice,  however, 
rounding  error  and  ill-conditioning  can  be  quite  problematic,  particularly  for  a  poor  choice  of 
nodes.  A  more  complete  computational  algorithm  would  include  information  on  how  to  optimally 
choose  the  M  nodes  9k  (as  well  as  the  appropriate  values  of  M). 

Additionally,  given  the  consistency  of  the  estimator  Pn,  it  would  be  desirable  to  place  some 
measure  of  confidence  on  the  estimated  probability  distribution.  The  traditional  frequentist 
approach  relies  on  either  asymptotic  theory  or  bootstrapping  to  construct  such  measures  of 
confidence.  In  the  former  case,  it  is  not  clear  how  one  might  extend  notions  of  sensitivity  to  the 


space  of  probability  measures,  which  would  require  a  notion  of  differentiability  on  the  space  of 
probability  measures.  In  the  latter  case,  the  results  provide  some  computational  estimates  but 
a  rigorous  theory  is  not  yet  available.  Some  preliminary  work  on  these  topics  has  been  initiated 
[12,  14]  and  is  still  ongoing. 
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